student <- read.table("student-mat.csv", sep=";", header=T)
student <- as_tibble(student)
student_all <- read.table("student-mat.csv", sep=";", header=T)
student_all <- as_tibble(student)
Portugal, in 2006, had an early school leaving rate of nearly 40% for 18-24 year olds, while the EU average was 15% (Eurostat 2007). Various analyses and interventions, therefore, have targeted school success in Portuguese adolescents and attempted to improve education overall. Portuguese students are lacking in mathematics performance, specifically.
We are therefore analyzing student data from two Portuguese secondary schools. This dataset measures mathematics course grade as the outcome, as a result of 30 predictors that range from demographic factors and family situation to student habits and extracurricular involvement. Specifically, data was collected from 395 students across an age range of 15-22 years.
Portuguese schools have a five-level grade classification system. With a final score measured from 0-20, categories are as follows: Excellent (16-20), Good (14-15), Satisfactory (12-13), Sufficient (10-11), Fail (0-9). For the purposes of this analysis, however, we analyzed exact score rather than score category.
In this analysis, we wish to predict student success in secondary school math as a result of two types of factors:
The potential outcomes of this analysis are:
The inital covariates in the dataset are:
And the response variables are:
For this analysis, we will be considering the G3 final grade as our response.
student %>%
ggplot(aes(G3)) + geom_histogram(binwidth = 1, color='black', fill = 'grey') +
geom_vline(xintercept = mean(student$G3), lwd = 1, color = 'red') +
labs(title = "Distribution of Final Math Grade", x="Final Grade", y="Count")
student %>% select(G1, G2, G3) %>% filter(G3 ==0 & G1 ==0 & G2 ==0)
## # A tibble: 0 x 3
## # … with 3 variables: G1 <int>, G2 <int>, G3 <int>
Let’s remove the students with a grade of 0, as that doesn’t make much practical sense, especially since all of these students have grades higher than 0 first and/or second period.
**BB: agree with this approach, if we are feeling adventurous we could treat those 0’s as missing and impute them but let’s not over complicate things yet # MICE code if we decide to impute ## student <- read.table(“student-mat.csv”, sep=“;”, header=T) student <- as.data.frame(student)
student\(G3[student\)G3 == 0] <- NA md.pattern(as.data.frame(student)) #38 students missing G3, almost 10% of data is missing imputed_data <- mice(student, m=5, maxit = 50, method = “pmm”, seed = 500) summary(imputed_data) imputed_data\(imp\)G3 imputed<-complete(imputed_data)
#Note this throws a warning about variables being non-numeric but still produces output for all numeric arguments. Also gives additional reason to look at G3 only as G1/G2 are highly correlated with G3
ggcorr(student)
student <- student %>% filter(G3 != 0) %>% select(-G1, -G2)
student_all <- student_all %>% select(-G1, -G2)
student %>%
ggplot(aes(G3)) + geom_histogram(binwidth = 1, color='black', fill = 'grey') +
geom_vline(xintercept = mean(student$G3), lwd = 1, color = 'red') +
labs(title = "Distribution of Final Math Grade", x="Final Grade", y="Count")
#BB: I don't see any real differences
ggcorr(student)
student %>%
gather(key = "Variable", value = "Value", -G3) %>%
ggplot(aes(Value, G3)) + facet_wrap(~ Variable, scales = "free") +
geom_boxplot()
## visualizing absences
student %>%
# filter(absences < 23) %>%
mutate(absences = factor(absences)) %>%
ggplot(aes(absences, G3)) +
geom_boxplot()
ggplot(student, aes(absences, G3)) +
geom_jitter() +
geom_smooth()
Since the variable “absences” has a greater range of values than most of the other variables, and since after 22 absences the data becomes extremely sparse, we decided to group 23-93 absences within the value “23.”
## Pooling all absences greater than 22 into the 23 group
student <- student %>%
mutate(absences = case_when(
absences > 22 ~ 23L,
TRUE ~ .$absences))
## visualizing new absences variable
student %>%
# filter(absences < 23) %>%
mutate(absences = factor(absences)) %>%
ggplot(aes(absences, G3)) +
geom_boxplot()
ggplot(student, aes(absences, G3)) +
geom_jitter() +
geom_smooth()
“Absences” still has a much greater spread of values than the other variables, so we will try binning them to see how that effects our models.
## Binning absences into 5 groups with 5 absence numbers each (except for the last group that includes all absences greater than or equal to 20).
# Binning
student <- student %>%
mutate(absences = case_when(
absences < 5 ~ 1L,
absences >= 5 & absences < 10 ~ 2L,
absences >= 10 & absences < 15 ~ 3L,
absences >= 15 & absences < 20 ~ 4L,
absences >= 20 ~ 5L,
TRUE ~ .$absences))
student_all <- student_all %>%
mutate(absences = case_when(
absences < 5 ~ 1L,
absences >= 5 & absences < 10 ~ 2L,
absences >= 10 & absences < 15 ~ 3L,
absences >= 15 & absences < 20 ~ 4L,
absences >= 20 ~ 5L,
TRUE ~ .$absences))
## visualizing new absences variable
student %>%
# filter(absences < 23) %>%
mutate(absences = factor(absences)) %>%
ggplot(aes(absences, G3)) +
geom_boxplot()
ggplot(student, aes(absences, G3)) +
geom_jitter() +
geom_smooth()
## visualizing alcohol consumption
# weekend alcohol consumption (5 is highest)
student %>%
mutate(Walc = factor(Walc)) %>%
ggplot(aes(Walc, G3)) +
geom_violin() +
geom_jitter(width = 0.2, alpha = 0.5)
# workday alcohol consumption (5 is highest)
student %>%
mutate(Dalc = factor(Dalc)) %>%
ggplot(aes(Dalc, G3)) +
geom_violin() +
geom_jitter(width = 0.2, alpha = 0.5)
## Time spend Studying
student %>%
mutate(studytime = factor(studytime)) %>%
ggplot(aes(studytime, G3)) +
geom_violin() +
geom_jitter(width = 0.2, alpha = 0.5)
## Time spend travelling
student %>%
mutate(traveltime = factor(traveltime)) %>%
ggplot(aes(traveltime, G3)) +
geom_violin() +
geom_jitter(width = 0.2, alpha = 0.5)
## Guardian
student %>%
mutate(guardian = factor(guardian)) %>%
ggplot(aes(guardian, G3)) +
geom_violin() +
geom_jitter(width = 0.2, alpha = 0.5)
Based on these plots, we decided to remove the variable “guardian” from our dataset since it didn’t seem to have any influence on the outcome.
We also observed that students in this dataset were from two different schools. Using school in our prediction of math course success, although it may increase prediction accuracy, would make our analysis less generalizable to the overall population of students. Since the overarching goal of this analysis is to identify factors that systematically affect secondary school student success in math courses, we decided to remove school as a predictor variable.
student$age <- student$age - min(student$age)
student <- student %>%
select(-guardian) %>%
select(-school)
student_all <- student_all %>%
select(-guardian) %>%
select(-school)
Just to get an initial sense of influential variables we can look at all variables up to this point and examine fit statistics. This bit of code will look at permutations of all possible models up to nvmax and then report the best model based on fit criteria. For example, with nvmax of 30 it will produce the best model for 1 covariate, 2 covariates…up to the full model.
#Based on results the best models fall within 5-20 vars, so I only considered 20 variables in this run
best_subset <- leaps::regsubsets(G3 ~ ., student, nvmax = 20, method='exhaustive')
results<-summary(best_subset)
tibble(predictors = 1:20,
adj_R2 = results$adjr2,
Cp = results$cp,
BIC = results$bic) %>%
gather(statistic, value, -predictors) %>%
ggplot(aes(predictors, value, color = statistic)) +
geom_line(show.legend = F) +
geom_point(show.legend = F) +
facet_wrap(~ statistic, scales = "free")
#which.max(results$adjr2)
# 17 variables
#which.min(results$bic)
# 8 variables
#which.min(results$cp)
# 14 variables
coef(best_subset, 8)
## (Intercept) Mjobhealth Mjobservices Fjobteacher failures
## 14.6123286 1.7919419 1.4684797 2.0295712 -1.1000624
## schoolsupyes goout health absences
## -2.3415181 -0.4729677 -0.2581331 -0.4284993
coef(best_subset, 7)
## (Intercept) Mjobhealth Mjobservices Fjobteacher failures
## 13.6819891 1.6852313 1.3837170 2.0288848 -1.1282540
## schoolsupyes goout absences
## -2.3052617 -0.4655962 -0.4141869
# We lose health when we drop to 7 variables. Otherwise all variables remain the same and are very similar in their effect size.
VIF and Tolerance
covariates_int <- model.matrix(G3 ~ ., data = student )
covariates <- covariates_int %>% as.data.frame() %>% select(-`(Intercept)`)
student_d <- data.frame(G3 = student$G3, covariates)
model1 <- lm(G3 ~ ., data = student_d)
VIF = car::vif(model1)
Tol = 1/VIF
df = data.frame(Variable = names(VIF), VIF = VIF, Tolerance = Tol)
df %>% arrange(desc(VIF)) %>% head(10) %>% knitr::kable(align = c("c", "c"))
| Variable | VIF | Tolerance |
|---|---|---|
| Fjobother | 6.342636 | 0.1576631 |
| Fjobservices | 5.576032 | 0.1793390 |
| Mjobteacher | 3.294431 | 0.3035425 |
| Mjobservices | 3.086430 | 0.3239989 |
| Mjobother | 2.919526 | 0.3425214 |
| Medu | 2.799083 | 0.3572598 |
| Fjobteacher | 2.766049 | 0.3615265 |
| Walc | 2.443405 | 0.4092649 |
| Mjobhealth | 2.423100 | 0.4126945 |
| Fjobhealth | 2.275655 | 0.4394340 |
#set up dataframe for all students dataset:
covariates_int_all <- model.matrix(G3 ~ ., data = student_all )
covariates_all <- covariates_int_all %>% as.data.frame() %>% select(-`(Intercept)`)
student_d_all <- data.frame(G3 = student_all$G3, covariates_all)
Displayed above are the covariates with the highest Variance Inflation Factors.
EigenAnalysis for the Correlation and Scaled SSCP Matrices
xtx = t(covariates_int)%*%covariates_int
Ds_half <- diag(diag(xtx)^-0.5)
sscp <- Ds_half %*% xtx %*% Ds_half
eig_sscp <- eigen(sscp)$values
PCs_sscp <- prcomp(sscp)[2]
CI_sscp <- sqrt(eig_sscp[1]/eig_sscp)
cov_center <- apply(covariates, 2, function(y) y - mean(y))
C <- (t(cov_center)%*%cov_center)/dim(cov_center)[1]
Dc_half <- diag(diag(C)^-0.5)
R <- Dc_half %*% C %*% Dc_half
eig_corr <- eigen(R)$values
CI_corr <- sqrt(eig_corr[1]/eig_corr)
PCs_corr <- prcomp(R)[2]
df <- data.frame("Eigenvalue" = c(eig_corr), "Condition Index" = c(CI_corr))
df2 <- data.frame("Eigenvalue" = c(eig_sscp), "Condition Index" = c(CI_sscp))
df %>% tail(2) %>% knitr::kable(align = c("c", "c"))
| Eigenvalue | Condition.Index | |
|---|---|---|
| 35 | 0.1042497 | 5.546255 |
| 36 | 0.0685997 | 6.837166 |
df2 %>% filter(`Condition.Index` > 30) %>% knitr::kable(align = c("c", "c"))
| Eigenvalue | Condition.Index |
|---|---|
| 0.0063958 | 57.15686 |
From the Correlation Matrix, the highest Condition Index is still very small, which means no collinearity needs to be taken care of there. From the Scaled SSCP Matrix, there are two Condition Indices over 30.
PCs_sscp$rotation[1:17,36:37]
## PC36 PC37
## [1,] -0.54238432 -0.768156681
## [2,] 0.01500628 0.001119414
## [3,] -0.01588949 0.052591238
## [4,] 0.04850298 0.010901229
## [5,] -0.01501923 0.024552007
## [6,] 0.06871137 0.031398323
## [7,] 0.36810775 -0.223722976
## [8,] -0.02014247 0.047035204
## [9,] -0.15226341 0.102934232
## [10,] -0.18494245 0.141719926
## [11,] -0.19775873 0.138712109
## [12,] -0.20887103 0.139184678
## [13,] -0.11254029 0.125983102
## [14,] -0.34217703 0.393506844
## [15,] -0.22943971 0.277311183
## [16,] -0.14535872 0.157940062
## [17,] -0.02812198 0.031426805
Above shows the last 2 PC’s of the Scaled SSCP Matrix for the 1st-17th covariates (the rest have values close to 0). We can see that the 8th and 11-17th covariates may be collinear with the intercept. These covariates are listed below.
colnames(student_d)[c(7, 10:16)]
## [1] "Medu" "Mjobother" "Mjobservices" "Mjobteacher"
## [5] "Fjobhealth" "Fjobother" "Fjobservices" "Fjobteacher"
Many of these are factor variables that have reference values in the itnercept, so it makes sense they migh tbe collinear. Not going to remove any.
There are many types of models to consider when building a regression classifier, such as Multiple Linear Regression, ANOVA, Linear Mixed Models, and Logistic/Poisson Regression. Classically, Linear Mixed Models are used when observations are related and Logistic/Poisson Regression are used for binrary outcomes. Since we are assuming our observations (the students) are independent for our purpose, and the outcome variable (final grade) is numeric, we narrow our model search down to Multiple Linear Regression and ANOVA. Since, at least to start, we have many factor-type covariates, we begin exploring Multiple Linear Regression models and their fit to our data.
*BB: since categorical vars are being grouped into several vars we have more than 30 Betas. I count 39 below.
The first model we consider is a full model of all covariates possible and no interaction terms included. The model can be written as follows:
\[y = \beta_0 + \beta_1X_1 + ... + \beta_{37}X_{37} + \epsilon\] where \(\beta_i\) represents … and \(X_i\) represents a covariate. There are more covariates than expected because each factor variable is recreated as a series of “dummy” variables. If a factor has \(n\) levels, then there will be \(n-1\) variables, and therefore parameters, corresponding to that covariate.
summary(model1)
##
## Call:
## lm(formula = G3 ~ ., data = student_d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5064 -1.8720 0.0293 1.8227 6.9600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.61184 1.74233 7.238 3.39e-12 ***
## sexM 0.57764 0.35502 1.627 0.10471
## age -0.18458 0.14030 -1.316 0.18923
## addressU 0.48945 0.40361 1.213 0.22615
## famsizeLE3 0.24556 0.34389 0.714 0.47571
## PstatusT -0.24239 0.50396 -0.481 0.63087
## Medu 0.17054 0.22593 0.755 0.45090
## Fedu 0.13734 0.19477 0.705 0.48125
## Mjobhealth 1.14310 0.80393 1.422 0.15603
## Mjobother -0.58096 0.52655 -1.103 0.27071
## Mjobservices 0.84323 0.58848 1.433 0.15287
## Mjobteacher -0.71277 0.74734 -0.954 0.34094
## Fjobhealth -0.63368 1.01710 -0.623 0.53371
## Fjobother -0.63043 0.74669 -0.844 0.39913
## Fjobservices -0.79711 0.77579 -1.027 0.30497
## Fjobteacher 1.18464 0.94423 1.255 0.21053
## reasonhome 0.26472 0.39912 0.663 0.50765
## reasonother -0.09380 0.57353 -0.164 0.87019
## reasonreputation 0.13815 0.41101 0.336 0.73699
## traveltime 0.02845 0.23921 0.119 0.90542
## studytime 0.51548 0.20779 2.481 0.01362 *
## failures -0.82301 0.25773 -3.193 0.00155 **
## schoolsupyes -2.42939 0.46632 -5.210 3.40e-07 ***
## famsupyes -0.63394 0.33989 -1.865 0.06308 .
## paidyes -0.44278 0.33601 -1.318 0.18852
## activitiesyes 0.12753 0.32141 0.397 0.69179
## nurseryyes -0.27189 0.39819 -0.683 0.49522
## higheryes 0.39340 0.85823 0.458 0.64699
## internetyes 0.61575 0.44649 1.379 0.16883
## romanticyes -0.09634 0.34193 -0.282 0.77832
## famrel 0.12137 0.17748 0.684 0.49457
## freetime 0.10646 0.16699 0.638 0.52424
## goout -0.45766 0.16638 -2.751 0.00628 **
## Dalc 0.05727 0.23179 0.247 0.80502
## Walc -0.15349 0.17833 -0.861 0.39005
## health -0.23289 0.11382 -2.046 0.04156 *
## absences -0.36947 0.14796 -2.497 0.01303 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.787 on 320 degrees of freedom
## Multiple R-squared: 0.3296, Adjusted R-squared: 0.2542
## F-statistic: 4.371 on 36 and 320 DF, p-value: 2.751e-13
** ADJUSTED R2 of 0.256 BOO **
In order to determine the validity of this model, we must test the assumptions of linear regression. These assumptions include homogeneity of variances and gaussian errors.
stud_resid = rstudent(model1)
dat_resid <- data.frame(RESID = stud_resid, ABS = abs(stud_resid), PRED = model1$fitted.values)
dat_resid %>% arrange(-ABS) %>% filter(ABS > 2)
## RESID ABS PRED
## 1 2.698348 2.698348 12.039974
## 2 2.544754 2.544754 10.371589
## 3 2.516308 2.516308 11.339381
## 4 -2.454889 2.454889 11.506401
## 5 2.421202 2.421202 11.522224
## 6 2.419423 2.419423 11.716532
## 7 2.398537 2.398537 12.654729
## 8 -2.324983 2.324983 12.230358
## 9 2.276336 2.276336 12.075219
## 10 2.260274 2.260274 13.063600
## 11 -2.241573 2.241573 11.990685
## 12 2.203381 2.203381 8.205372
## 13 -2.125594 2.125594 11.673253
## 14 2.118535 2.118535 12.430768
## 15 2.057514 2.057514 12.684111
## 16 -2.048861 2.048861 11.302894
## 17 -2.026352 2.026352 12.250541
## 18 -2.022032 2.022032 11.184054
Since this set of largest studentized residuals are all greater than 2, it may be of interest to examine them in further detail. ** this part might be better in the outlier detection section **
nortest::ad.test(dat_resid$RESID)
##
## Anderson-Darling normality test
##
## data: dat_resid$RESID
## A = 0.29059, p-value = 0.6087
From the code above, since the p-value 0.4184 is greater than \(\alpha = 0.05\), we accept the null hypothesis that the residuals are normally distributed.
p1 <- ggplot(data = dat_resid) + geom_histogram(aes(RESID), bins = 25) + labs(x = "Studentized Residuals")
p2 <- ggplot(data = dat_resid) + geom_point(aes(PRED, RESID)) + geom_hline(aes(yintercept = 0)) + labs(x = "Predicted Values", y = "Studentized Residuals")
cowplot::plot_grid(p1, p2)
Since, from the collinearity analysis, several variables were significantly colinear with the intercept, we next try to build a model identical to the full model with the removal of the intercept.
model2 <- lm(G3 ~ . -1, data = student_d)
summary(model2)
##
## Call:
## lm(formula = G3 ~ . - 1, data = student_d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5506 -1.8695 0.1517 1.8989 8.0781
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## sexM 0.754517 0.381480 1.978 0.04880 *
## age 0.014509 0.148178 0.098 0.92206
## addressU 1.105820 0.424932 2.602 0.00969 **
## famsizeLE3 0.496159 0.368521 1.346 0.17914
## PstatusT 0.768732 0.521541 1.474 0.14147
## Medu 0.317803 0.242360 1.311 0.19070
## Fedu 0.332139 0.207772 1.599 0.11090
## Mjobhealth 0.780292 0.864211 0.903 0.36726
## Mjobother -0.506167 0.567026 -0.893 0.37270
## Mjobservices 0.668814 0.633313 1.056 0.29174
## Mjobteacher -1.176189 0.801990 -1.467 0.14347
## Fjobhealth 1.128215 1.063670 1.061 0.28963
## Fjobother 1.261640 0.753360 1.675 0.09497 .
## Fjobservices 1.188711 0.781597 1.521 0.12928
## Fjobteacher 2.942862 0.982784 2.994 0.00296 **
## reasonhome 0.452118 0.428985 1.054 0.29271
## reasonother 0.561494 0.610001 0.920 0.35801
## reasonreputation 0.277318 0.442207 0.627 0.53102
## traveltime 0.506018 0.247660 2.043 0.04185 *
## studytime 0.895230 0.216552 4.134 4.55e-05 ***
## failures -0.619055 0.275930 -2.244 0.02554 *
## schoolsupyes -2.093990 0.499782 -4.190 3.61e-05 ***
## famsupyes -0.610801 0.366076 -1.669 0.09619 .
## paidyes -0.542762 0.361601 -1.501 0.13434
## activitiesyes 0.050217 0.345997 0.145 0.88469
## nurseryyes 0.120396 0.424898 0.283 0.77709
## higheryes 3.253194 0.820618 3.964 9.08e-05 ***
## internetyes 0.795038 0.480166 1.656 0.09875 .
## romanticyes -0.188022 0.368039 -0.511 0.60979
## famrel 0.464018 0.184239 2.519 0.01227 *
## freetime 0.389599 0.174863 2.228 0.02657 *
## goout -0.440626 0.179184 -2.459 0.01446 *
## Dalc -0.005198 0.249484 -0.021 0.98339
## Walc -0.031654 0.191219 -0.166 0.86863
## health -0.088728 0.120704 -0.735 0.46282
## absences -0.260683 0.158545 -1.644 0.10111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.002 on 321 degrees of freedom
## Multiple R-squared: 0.9434, Adjusted R-squared: 0.937
## F-statistic: 148.6 on 36 and 321 DF, p-value: < 2.2e-16
** ADJUSTED R2 OF .9421 !!!!!! **
stud_resid = rstudent(model2)
dat_resid <- data.frame(RESID = stud_resid, ABS = abs(stud_resid), PRED = model2$fitted.values)
dat_resid %>% arrange(-ABS) %>% filter(ABS > 2)
## RESID ABS PRED
## 1 2.907693 2.907693 10.921897
## 2 2.535449 2.535449 6.816460
## 3 2.531060 2.531060 9.895896
## 4 2.475705 2.475705 11.073511
## 5 2.449680 2.449680 11.011177
## 6 2.424527 2.424527 12.089761
## 7 -2.268277 2.268277 12.550590
## 8 2.233351 2.233351 11.676215
## 9 -2.215128 2.215128 11.334789
## 10 2.168795 2.168795 11.738572
## 11 2.100816 2.100816 7.033648
## 12 -2.080427 2.080427 12.806819
## 13 2.079162 2.079162 13.036056
## 14 -2.050184 2.050184 11.897140
## 15 -2.044874 2.044874 11.702064
nortest::ad.test(dat_resid$RESID)
##
## Anderson-Darling normality test
##
## data: dat_resid$RESID
## A = 0.33791, p-value = 0.503
p1 <- ggplot(data = dat_resid) + geom_histogram(aes(RESID), bins = 25) + labs(x = "Studentized Residuals")
p2 <- ggplot(data = dat_resid) + geom_point(aes(PRED, RESID)) + geom_hline(aes(yintercept = 0)) + labs(x = "Predicted Values", y = "Studentized Residuals")
cowplot::plot_grid(p1, p2)
set.seed(12345)
modelback <- step(model1, direction = "backward", trace = F)
modelback
##
## Call:
## lm(formula = G3 ~ sexM + age + addressU + Mjobhealth + Mjobservices +
## Fjobteacher + studytime + failures + schoolsupyes + famsupyes +
## goout + health + absences, data = student_d)
##
## Coefficients:
## (Intercept) sexM age addressU Mjobhealth
## 13.1670 0.6631 -0.2024 0.6295 1.7986
## Mjobservices Fjobteacher studytime failures schoolsupyes
## 1.4065 2.1208 0.5661 -0.9418 -2.3196
## famsupyes goout health absences
## -0.6310 -0.4774 -0.2617 -0.3240
modelboth <- step(model1, direction = "both", trace = F)
modelboth
##
## Call:
## lm(formula = G3 ~ sexM + age + addressU + Mjobhealth + Mjobservices +
## Fjobteacher + studytime + failures + schoolsupyes + famsupyes +
## goout + health + absences, data = student_d)
##
## Coefficients:
## (Intercept) sexM age addressU Mjobhealth
## 13.1670 0.6631 -0.2024 0.6295 1.7986
## Mjobservices Fjobteacher studytime failures schoolsupyes
## 1.4065 2.1208 0.5661 -0.9418 -2.3196
## famsupyes goout health absences
## -0.6310 -0.4774 -0.2617 -0.3240
modelforward <- step(model1, direction = "forward", trace = F)
modelforward
##
## Call:
## lm(formula = G3 ~ sexM + age + addressU + famsizeLE3 + PstatusT +
## Medu + Fedu + Mjobhealth + Mjobother + Mjobservices + Mjobteacher +
## Fjobhealth + Fjobother + Fjobservices + Fjobteacher + reasonhome +
## reasonother + reasonreputation + traveltime + studytime +
## failures + schoolsupyes + famsupyes + paidyes + activitiesyes +
## nurseryyes + higheryes + internetyes + romanticyes + famrel +
## freetime + goout + Dalc + Walc + health + absences, data = student_d)
##
## Coefficients:
## (Intercept) sexM age addressU
## 12.61184 0.57764 -0.18458 0.48945
## famsizeLE3 PstatusT Medu Fedu
## 0.24556 -0.24239 0.17054 0.13734
## Mjobhealth Mjobother Mjobservices Mjobteacher
## 1.14310 -0.58096 0.84323 -0.71277
## Fjobhealth Fjobother Fjobservices Fjobteacher
## -0.63368 -0.63043 -0.79711 1.18464
## reasonhome reasonother reasonreputation traveltime
## 0.26472 -0.09380 0.13815 0.02845
## studytime failures schoolsupyes famsupyes
## 0.51548 -0.82301 -2.42939 -0.63394
## paidyes activitiesyes nurseryyes higheryes
## -0.44278 0.12753 -0.27189 0.39340
## internetyes romanticyes famrel freetime
## 0.61575 -0.09634 0.12137 0.10646
## goout Dalc Walc health
## -0.45766 0.05727 -0.15349 -0.23289
## absences
## -0.36947
Backward and Both give the same variables.
Forward gives all the variables.
DOES NOT give same exact answer if we use model1 with intercept or model2 without to do the stepwise process.
summary(modelback)
##
## Call:
## lm(formula = G3 ~ sexM + age + addressU + Mjobhealth + Mjobservices +
## Fjobteacher + studytime + failures + schoolsupyes + famsupyes +
## goout + health + absences, data = student_d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9070 -1.9329 -0.1166 1.8534 7.3467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.1670 0.8580 15.347 < 2e-16 ***
## sexM 0.6631 0.3151 2.104 0.036096 *
## age -0.2024 0.1294 -1.564 0.118670
## addressU 0.6295 0.3612 1.743 0.082292 .
## Mjobhealth 1.7986 0.5323 3.379 0.000811 ***
## Mjobservices 1.4065 0.3444 4.084 5.51e-05 ***
## Fjobteacher 2.1208 0.5664 3.745 0.000212 ***
## studytime 0.5661 0.1880 3.012 0.002788 **
## failures -0.9418 0.2358 -3.994 7.94e-05 ***
## schoolsupyes -2.3196 0.4464 -5.197 3.49e-07 ***
## famsupyes -0.6310 0.3116 -2.025 0.043653 *
## goout -0.4774 0.1366 -3.494 0.000538 ***
## health -0.2617 0.1060 -2.468 0.014074 *
## absences -0.3240 0.1328 -2.440 0.015181 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.743 on 343 degrees of freedom
## Multiple R-squared: 0.3042, Adjusted R-squared: 0.2778
## F-statistic: 11.53 on 13 and 343 DF, p-value: < 2.2e-16
This model so far gives us the best R2 adjusted and the most significant variables.
stud_resid = rstudent(modelback)
dat_resid <- data.frame(RESID = stud_resid, ABS = abs(stud_resid), PRED = modelback$fitted.values)
dat_resid %>% arrange(-ABS) %>% filter(ABS > 2)
## RESID ABS PRED
## 1 2.728324 2.728324 11.653339
## 2 2.714544 2.714544 11.763080
## 3 2.594468 2.594468 10.103717
## 4 2.493234 2.493234 12.314119
## 5 2.350334 2.350334 11.859641
## 6 2.332082 2.332082 11.696637
## 7 2.254386 2.254386 11.999121
## 8 2.196549 2.196549 12.233189
## 9 -2.182972 2.182972 11.906954
## 10 -2.182586 2.182586 10.906406
## 11 2.128817 2.128817 12.355736
## 12 -2.100461 2.100461 11.622068
## 13 -2.052568 2.052568 11.514590
## 14 2.048496 2.048496 8.495680
## 15 2.035560 2.035560 7.549169
## 16 2.005830 2.005830 10.601775
nortest::ad.test(dat_resid$RESID)
##
## Anderson-Darling normality test
##
## data: dat_resid$RESID
## A = 0.55571, p-value = 0.1506
p1 <- ggplot(data = dat_resid) + geom_histogram(aes(RESID), bins = 25) + labs(x = "Studentized Residuals")
p2 <- ggplot(data = dat_resid) + geom_point(aes(PRED, RESID)) + geom_hline(aes(yintercept = 0)) + labs(x = "Predicted Values", y = "Studentized Residuals")
cowplot::plot_grid(p1, p2)
Of all possible interaction terms, several were significant. Among those, the most “interpretable” are studytime:schoolsupyes, studytime:famsupyes, studytime:goout, and schoolsupyes:goout. Interestingly, it appears that the effects of studytime are reduced when one goes out a lot; perhaps fatigue reduces the effectiveness of studying.
model.int <- lm(G3 ~ sexM + Mjobhealth + Mjobservices + Fjobteacher + studytime +
failures + schoolsupyes + famsupyes + paidyes + internetyes + goout + health +
absences + studytime:schoolsupyes + studytime:famsupyes + studytime:goout + schoolsupyes:goout,
data = student_d)
summary(model.int)
##
## Call:
## lm(formula = G3 ~ sexM + Mjobhealth + Mjobservices + Fjobteacher +
## studytime + failures + schoolsupyes + famsupyes + paidyes +
## internetyes + goout + health + absences + studytime:schoolsupyes +
## studytime:famsupyes + studytime:goout + schoolsupyes:goout,
## data = student_d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3845 -1.7486 -0.0282 1.8463 7.2936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.9712 1.3878 7.906 3.78e-14 ***
## sexM 0.5683 0.3133 1.814 0.070602 .
## Mjobhealth 1.8297 0.5264 3.476 0.000576 ***
## Mjobservices 1.4148 0.3406 4.154 4.14e-05 ***
## Fjobteacher 2.0923 0.5615 3.727 0.000227 ***
## studytime 1.4287 0.5851 2.442 0.015131 *
## failures -1.1644 0.2323 -5.012 8.70e-07 ***
## schoolsupyes 3.1224 1.6756 1.863 0.063259 .
## famsupyes -1.5353 0.8059 -1.905 0.057613 .
## paidyes -0.4169 0.3101 -1.344 0.179735
## internetyes 0.6524 0.4080 1.599 0.110768
## goout 0.2661 0.3562 0.747 0.455607
## health -0.2340 0.1052 -2.225 0.026749 *
## absences -0.3547 0.1313 -2.702 0.007231 **
## studytime:schoolsupyes -1.4887 0.5333 -2.791 0.005545 **
## studytime:famsupyes 0.5053 0.3740 1.351 0.177549
## studytime:goout -0.3400 0.1653 -2.057 0.040417 *
## schoolsupyes:goout -0.7082 0.3604 -1.965 0.050213 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.708 on 339 degrees of freedom
## Multiple R-squared: 0.3296, Adjusted R-squared: 0.296
## F-statistic: 9.804 on 17 and 339 DF, p-value: < 2.2e-16
stud_resid = rstudent(model.int)
dat_resid <- data.frame(RESID = stud_resid, ABS = abs(stud_resid), PRED = model.int$fitted.values)
dat_resid %>% arrange(-ABS) %>% filter(ABS > 2)
## RESID ABS PRED
## 1 2.767414 2.767414 11.70644
## 2 2.617816 2.617816 12.03399
## 3 2.550088 2.550088 12.30526
## 4 -2.450654 2.450654 11.38453
## 5 2.406855 2.406855 11.82766
## 6 2.371032 2.371032 10.73981
## 7 2.347761 2.347761 11.75319
## 8 2.327249 2.327249 11.88466
## 9 2.318470 2.318470 7.90480
## 10 -2.288029 2.288029 12.10269
## 11 -2.231194 2.231194 10.94818
## 12 -2.121865 2.121865 11.65760
## 13 2.037154 2.037154 12.63631
nortest::ad.test(dat_resid$RESID)
##
## Anderson-Darling normality test
##
## data: dat_resid$RESID
## A = 0.46134, p-value = 0.2578
Normality assumption is justified for model.int
p1 <- ggplot(data = dat_resid) + geom_histogram(aes(RESID), bins = 25) + labs(x = "Studentized Residuals")
p2 <- ggplot(data = dat_resid) + geom_point(aes(PRED, RESID)) + geom_hline(aes(yintercept = 0)) + labs(x = "Predicted Values", y = "Studentized Residuals")
cowplot::plot_grid(p1, p2)
lm(formula = G3 ~ sexM + Mjobhealth + Mjobservices +
Fjobteacher + studytime + failures + schoolsupyes + famsupyes +
goout + health + absences, data = student_d) %>%
summary()
##
## Call:
## lm(formula = G3 ~ sexM + Mjobhealth + Mjobservices + Fjobteacher +
## studytime + failures + schoolsupyes + famsupyes + goout +
## health + absences, data = student_d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8085 -1.9348 -0.1866 1.9893 7.2760
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.3782 0.7956 16.816 < 2e-16 ***
## sexM 0.6779 0.3164 2.143 0.032841 *
## Mjobhealth 1.9787 0.5307 3.729 0.000225 ***
## Mjobservices 1.5156 0.3436 4.411 1.38e-05 ***
## Fjobteacher 2.1484 0.5694 3.773 0.000190 ***
## studytime 0.5406 0.1888 2.863 0.004453 **
## failures -1.0778 0.2290 -4.706 3.66e-06 ***
## schoolsupyes -2.1153 0.4326 -4.890 1.55e-06 ***
## famsupyes -0.5913 0.3120 -1.895 0.058899 .
## goout -0.4831 0.1365 -3.539 0.000457 ***
## health -0.2596 0.1065 -2.437 0.015334 *
## absences -0.3581 0.1314 -2.725 0.006755 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.76 on 345 degrees of freedom
## Multiple R-squared: 0.2914, Adjusted R-squared: 0.2688
## F-statistic: 12.9 on 11 and 345 DF, p-value: < 2.2e-16
modelsmall <- lm(G3 ~ Mjobhealth + Mjobservices + Fjobteacher + failures + schoolsupyes + goout, data = student_d)
summary(modelsmall)
##
## Call:
## lm(formula = G3 ~ Mjobhealth + Mjobservices + Fjobteacher + failures +
## schoolsupyes + goout, data = student_d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5735 -1.8793 -0.0725 1.9254 7.9254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.0704 0.4698 27.819 < 2e-16 ***
## Mjobhealth 1.7801 0.5422 3.283 0.001130 **
## Mjobservices 1.3399 0.3522 3.804 0.000168 ***
## Fjobteacher 2.0188 0.5860 3.445 0.000640 ***
## failures -1.2676 0.2301 -5.508 7.05e-08 ***
## schoolsupyes -2.2548 0.4384 -5.143 4.51e-07 ***
## goout -0.4989 0.1405 -3.552 0.000435 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.855 on 350 degrees of freedom
## Multiple R-squared: 0.231, Adjusted R-squared: 0.2178
## F-statistic: 17.52 on 6 and 350 DF, p-value: < 2.2e-16
stud_resid_small = rstudent(modelsmall)
dat_resid_small <- data.frame(RESID = stud_resid_small, ABS = abs(stud_resid_small), PRED = modelsmall$fitted.values)
dat_resid_small %>% arrange(-ABS) %>% filter(ABS > 2)
## RESID ABS PRED
## 1 2.814574 2.814574 11.07460
## 2 2.630448 2.630448 11.57355
## 3 2.453589 2.453589 12.07249
## 4 -2.334139 2.334139 12.57143
## 5 -2.323367 2.323367 11.57355
## 6 2.281179 2.281179 10.57566
## 7 -2.277173 2.277173 12.41453
## 8 2.270601 2.270601 11.57355
## 9 2.270601 2.270601 11.57355
## 10 2.205028 2.205028 13.85261
## 11 2.166041 2.166041 11.91558
## 12 -2.146474 2.146474 12.07249
## 13 -2.146474 2.146474 12.07249
nortest::ad.test(dat_resid_small$RESID)
##
## Anderson-Darling normality test
##
## data: dat_resid_small$RESID
## A = 0.3526, p-value = 0.4646
Small model satisfies conditions for Gaussian assumption.
p1.small <- ggplot(data = dat_resid_small) + geom_histogram(aes(RESID), bins = 25) + labs(x = "Studentized Residuals")
p2.small <- ggplot(data = dat_resid_small) + geom_point(aes(PRED, RESID)) + geom_hline(aes(yintercept = 0)) + labs(x = "Predicted Values", y = "Studentized Residuals")
cowplot::plot_grid(p1.small, p2.small)
The small model also appears to satisfy the diagnostic checks for homogeneity and linearity. Existence is trivial, but we may to discuss limitations of the independence assumption (maybe kids form study groups or grades are based on a curve)
model6 <- lm(formula = G3 ~ sexM + Mjobhealth + Mjobservices +
Fjobteacher + studytime + failures + schoolsupyes + famsupyes +
goout + health + absences, data = student_d)
summary(model6)
##
## Call:
## lm(formula = G3 ~ sexM + Mjobhealth + Mjobservices + Fjobteacher +
## studytime + failures + schoolsupyes + famsupyes + goout +
## health + absences, data = student_d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8085 -1.9348 -0.1866 1.9893 7.2760
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.3782 0.7956 16.816 < 2e-16 ***
## sexM 0.6779 0.3164 2.143 0.032841 *
## Mjobhealth 1.9787 0.5307 3.729 0.000225 ***
## Mjobservices 1.5156 0.3436 4.411 1.38e-05 ***
## Fjobteacher 2.1484 0.5694 3.773 0.000190 ***
## studytime 0.5406 0.1888 2.863 0.004453 **
## failures -1.0778 0.2290 -4.706 3.66e-06 ***
## schoolsupyes -2.1153 0.4326 -4.890 1.55e-06 ***
## famsupyes -0.5913 0.3120 -1.895 0.058899 .
## goout -0.4831 0.1365 -3.539 0.000457 ***
## health -0.2596 0.1065 -2.437 0.015334 *
## absences -0.3581 0.1314 -2.725 0.006755 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.76 on 345 degrees of freedom
## Multiple R-squared: 0.2914, Adjusted R-squared: 0.2688
## F-statistic: 12.9 on 11 and 345 DF, p-value: < 2.2e-16
** keep this or should we not use stuff from the subset analysis?
stud_resid = rstudent(model6)
dat_resid <- data.frame(RESID = stud_resid, ABS = abs(stud_resid), PRED = model6$fitted.values)
dat_resid %>% arrange(-ABS) %>% filter(ABS > 2)
## RESID ABS PRED
## 1 2.678210 2.678210 11.724004
## 2 2.669086 2.669086 11.777365
## 3 2.428470 2.428470 11.392824
## 4 2.421910 2.421910 12.449989
## 5 2.406835 2.406835 11.671522
## 6 2.300255 2.300255 10.796698
## 7 2.196107 2.196107 12.107675
## 8 -2.131205 2.131205 11.808485
## 9 2.115176 2.115176 12.240829
## 10 2.091643 2.091643 12.401337
## 11 2.061395 2.061395 8.423945
## 12 -2.039452 2.039452 10.565676
## 13 2.037242 2.037242 13.541178
## 14 2.033488 2.033488 12.575063
## 15 2.020778 2.020778 10.525226
nortest::ad.test(dat_resid$RESID)
##
## Anderson-Darling normality test
##
## data: dat_resid$RESID
## A = 0.74385, p-value = 0.05225
p1 <- ggplot(data = dat_resid) + geom_histogram(aes(RESID), bins = 25) + labs(x = "Studentized Residuals")
p2 <- ggplot(data = dat_resid) + geom_point(aes(PRED, RESID)) + geom_hline(aes(yintercept = 0)) + labs(x = "Predicted Values", y = "Studentized Residuals")
cowplot::plot_grid(p1, p2)
smallint <- lm(G3 ~ Mjobhealth + Mjobservices + Fjobteacher + studytime + failures + schoolsupyes + absences + studytime*schoolsupyes, data = student_d)
summary(smallint)
##
## Call:
## lm(formula = G3 ~ Mjobhealth + Mjobservices + Fjobteacher + studytime +
## failures + schoolsupyes + absences + studytime * schoolsupyes,
## data = student_d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3169 -1.8571 -0.0307 1.9693 8.5644
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.1451 0.5269 21.153 < 2e-16 ***
## Mjobhealth 1.5611 0.5367 2.909 0.003861 **
## Mjobservices 1.3264 0.3497 3.793 0.000176 ***
## Fjobteacher 2.0558 0.5830 3.526 0.000478 ***
## studytime 0.5550 0.1959 2.833 0.004884 **
## failures -1.1399 0.2328 -4.896 1.5e-06 ***
## schoolsupyes 0.1176 1.2027 0.098 0.922180
## absences -0.4215 0.1331 -3.168 0.001672 **
## studytime:schoolsupyes -1.1585 0.5398 -2.146 0.032571 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.83 on 348 degrees of freedom
## Multiple R-squared: 0.2484, Adjusted R-squared: 0.2312
## F-statistic: 14.38 on 8 and 348 DF, p-value: < 2.2e-16
stud_resid = rstudent(smallint)
dat_resid <- data.frame(RESID = stud_resid, ABS = abs(stud_resid), PRED = smallint$fitted.values)
dat_resid %>% arrange(-ABS) %>% filter(ABS > 2)
## RESID ABS PRED
## 1 3.084691 3.084691 10.43556
## 2 2.558145 2.558145 10.85706
## 3 2.515938 2.515938 11.96703
## 4 2.363309 2.363309 12.38852
## 5 2.362362 2.362362 11.47392
## 6 2.273524 2.273524 11.83483
## 7 -2.262882 2.262882 12.31691
## 8 2.213032 2.213032 12.91281
## 9 2.143890 2.143890 10.99055
## 10 -2.137058 2.137058 10.99055
## 11 -2.105066 2.105066 10.72103
## 12 2.088324 2.088324 13.15990
## 13 -2.077790 2.077790 11.83354
## 14 -2.077790 2.077790 11.83354
## 15 -2.077790 2.077790 11.83354
## 16 -2.077790 2.077790 11.83354
## 17 2.001355 2.001355 12.38852
## 18 2.000896 2.000896 14.50464
nortest::ad.test(dat_resid$RESID)
##
## Anderson-Darling normality test
##
## data: dat_resid$RESID
## A = 0.53226, p-value = 0.1725
p1 <- ggplot(data = dat_resid) + geom_histogram(aes(RESID), bins = 25) + labs(x = "Studentized Residuals")
p2 <- ggplot(data = dat_resid) + geom_point(aes(PRED, RESID)) + geom_hline(aes(yintercept = 0)) + labs(x = "Predicted Values", y = "Studentized Residuals")
cowplot::plot_grid(p1, p2)
model6_all <- lm(G3 ~ Mjobhealth + Mjobservices + Fjobteacher + failures + schoolsupyes + goout + health + absences, data = student_d_all)
summary(model6_all)
##
## Call:
## lm(formula = G3 ~ Mjobhealth + Mjobservices + Fjobteacher + failures +
## schoolsupyes + goout + health + absences, data = student_d_all)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1619 -1.7970 0.4408 2.7480 8.4881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.1885 0.8886 13.716 < 2e-16 ***
## Mjobhealth 2.2767 0.7659 2.972 0.00314 **
## Mjobservices 1.5218 0.4909 3.100 0.00208 **
## Fjobteacher 1.4156 0.8087 1.751 0.08083 .
## failures -2.1902 0.2891 -7.576 2.66e-13 ***
## schoolsupyes -1.2548 0.6285 -1.996 0.04659 *
## goout -0.4418 0.1910 -2.313 0.02123 *
## health -0.1973 0.1523 -1.295 0.19609
## absences 0.2879 0.1885 1.527 0.12755
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.164 on 386 degrees of freedom
## Multiple R-squared: 0.1907, Adjusted R-squared: 0.1739
## F-statistic: 11.37 on 8 and 386 DF, p-value: 1.699e-14
Influence – Cook’s Distance - Values near 1 are bad
Outlier – Leverage Values - Values greater than 2*p/n are bad
p <- modelback$coefficients %>% length()
n <- nrow(student_d)
levcutoff <- 2*p/n
lev <- hatvalues(modelback)
cook <- cooks.distance(modelback)
take out values that are outlliers and run the regression again?
We decide to further remove a few outliers informally to increase the accuracy of our analysis. For example, we noticed that there were five students in the 20-22 age range. Since ages 15-18 had 70+ students per age, and age 19 had 19 students, we decided to remove the 20-22 year old students. For students of this age to be in secondary school, it is likely that they had certain extenuating circumstances that may not be represented in the dataset. Moreover, we simply have too few observations in this age range to be able to accurately analyze the effects of age on outcome in the 20+ region. Therefore, we went ahead and removed these students.
table(student$age)
##
## 0 1 2 3 4 5 6 7
## 76 97 90 70 19 3 1 1
student <- student %>%
filter(age < 20-15)
Aside from looking at the Adjusted R2, can look at the AIC or SBC/BIC (smaller is better in both cases) can add more/different models to this code, but here is the basis
aic <- c(AIC(model1), AIC(model2), AIC(modelback), AIC(model.int), AIC(modelsmall), AIC(model6), AIC(smallint))
bic <- c(BIC(model1), BIC(model2), BIC(modelback), BIC(model.int), BIC(modelsmall), BIC(model6), BIC(smallint))
adjR <- c( summary(model1)$adj.r.sq, summary(model2)$adj.r.sq, summary(modelback)$adj.r.sq, summary(model.int)$adj.r.sq, summary(modelsmall)$adj.r.sq, summary(model6)$adj.r.sq, summary(smallint)$adj.r.sq)
selection <- data.frame(model = c("Model 1: Full Model", "Model 2: No Intercept", "Model 3: Stepwise Model", "Model 4: Stepwise Model + Interaction Terms", "Model 5: Reduced Stepwise Model", "Model 6: Semi-Reduced Stepwise Model", "Model 7: Reduced Interaction Model"), AICs = aic, BICs = bic, Adjusted.R.Squared = adjR)
selection %>% knitr::kable(align = c("c", "c"))
| model | AICs | BICs | Adjusted.R.Squared |
|---|---|---|---|
| Model 1: Full Model | 1782.011 | 1929.365 | 0.2542165 |
| Model 2: No Intercept | 1834.145 | 1977.621 | 0.9370466 |
| Model 3: Stepwise Model | 1749.316 | 1807.482 | 0.2778046 |
| Model 4: Stepwise Model + Interaction Terms | 1744.031 | 1817.708 | 0.2959758 |
| Model 5: Reduced Stepwise Model | 1771.009 | 1802.031 | 0.2178292 |
| Model 6: Semi-Reduced Stepwise Model | 1751.814 | 1802.225 | 0.2688021 |
| Model 7: Reduced Interaction Model | 1766.823 | 1805.600 | 0.2311681 |
[1] P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th Future Business Technology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.